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Abstract. The problem of verifying multi-threaded execution against the mem- 
ory consistency model of a processor is known to be an NP hard problem. How- 
ever polynomial time algorithms exist that detect almost all failures in such execu- 
tion. These are often used in practice for microprocessor verification. We present 
a low complexity and fully parallelized algorithm to check program execution 
against the processor consistency model. In addition our algorithm is general 
enough to support a number of consistency models without any degradation in 
performance. An implementation of this algorithm is currently used in practice to 
verify processors in the post silicon stage for multiple architectures. 



1 Introduction 

Verifying processor execution against its stated memory consistency model is an im- 
portant problem in both design and silicon system verification. Verification teams for a 
microprocessor are often concerned with the memory consistency model visible to ex- 
ternal customers such as system programmers. In the context of multi-threading, both 
in terms of Simultaneous Multi Threading(SMT) and Chip Multi Processing(CMP), 
Intel®' and other CPU manufacturers are increasingly building complex processors 
and SMP platforms with a large number of execution threads. In this environment the 
memory consistency model of microprocessors will come under close scrutiny, particu- 
larly by developers of multi-threaded applications and operating systems. Allowing any 
errors in implementing the consistency model to show up as customer visible is thus 
unacceptable. The problem we are concerned with is that of matching the result of exe- 
cuting a random set of load store memory operations distributed across processors, on a 
set of shared locations, against a memory consistency model. The algorithm should flag 
an error if the consistency model does not allow the observed execution results. This 
forms the basis for Random Instruction Test(RJT) generators such as TSOTOOL*^ 1 1] 
and Intel's Multi Processor(MP) RIT environment. The Intel MP RIT Tool incorporates 
the algorithm in this paper. Formally, we concentrate on variations of the VSC (Veri- 
fying Sequential Consistency) problem |2|. The VSC problem is exactly the problem 
described above, when restricted to sequential consistency. The general VSC problem 
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is NP complete (3) . The general coherence problem has also been shown to be NP com- 
plete |4l . A formulation of VSC for more general memory consistency models was done 
in f\] where a polynomial time algorithm was presented for verifying a memory consis- 
tency model at the cost of correctness, although the incorrect executions missed were 
shown to be insignificant for the purpose of CPU verification. That work focused almost 
exclusively on the Total Store Order(TSO) memory consistency model and presented 
a worst case 0{n^) algorithm. In this work, we present an efficient implementation of 
the basic algorithm in IQ. Our key contribution is to reduce the worst case complexity 
to 0{n^) for any memory consistency model using 0{n^) space. Although the work 
in [5 1 has reduced the complexity to 0{kn^) where k is the number of processors, 
that algorithm assumes the TSO memory consistency model and does not generalize 
to other models. Our motivation for generalizing and improving it is Intel's complex 
verification environment, where microprocessors support as many as five different con- 
sistency models at the same time. The primary objectives of our algorithm design are 
simplicity, performance and seamless extendibility in the implementation to any pro- 
cessor envkonment, including the Itanium®^. Another goal is enhanced support for 
debugging reported failures, which is crucial to reducing time to market for complex 
multi processors. 

The algorithm we have developed is currently implemented in Intel's in house ran- 
dom test generator and is used by both the IA-32 and Itanium verification teams. We 
also present scalability results and a processor bug that was caught by the tool using 
this algorithm. 



2 Memory Consistency 

Consider a set of processors each of which executes a stream of loads and stores. These 
are done to a set of locations shared across the processors. We are concerned with a 
global ordering of all the loads and stores, which when executed serially leads to the 
same result. The strictest consistency model is the sequential consistency (SC) model 
which insists that the only valid orderings are those that do not relax per processor pro- 
gram order between the memory operations. Relaxing restrictions between operations 
such as stores and loads leads to progressively weaker models such as Total Store Order 
(TSO) and Release Consistency (RC). All these are surveyed in (61 . We point out that in 
these orderings we refer to load executions and store executions. A load is considered 
performed(or executed) if no subsequent store to that location(on any processor) can 
change the load return value. A store is considered performed(or executed) if any sub- 
sequent load to that location (on any processor) returns its value. These are definitions 
from ^TJ. Any instruction on a modern pipelined processor has a number of phases and 
some, such as instruction fetch and retirement, occur in strict program order without 
regard to the memory consistency model. We are concerned only with ordering the load 
and store execution phases for instructions referring to memory. 

^ Itanium® is a trademark or registered trademark of Intel Corporation or its subsidiaries in the 
United States and other countries. 



2.1 Formalism 



The terminology used in this paper is similar to fl^. We use ; to denote program order 
and < to denote global order. Thus A; B and A < B mean that B follows A in program 
order and global order respectively. The fundamental operations in our test consist of 

and S*^ which are loads and stores respectively to location a by processor i. We 
also consider [L\ \ 5*] which is an atomic load store operation. Examples are XCHG in 
IA-32 and FETCHADD in Itanium g|. We use val{Ll) to denote the load return 
value of a load operation and val{Sl) to denote the value stored by a store operation. 

For any location a we define the type of a location to be 
Type{a) e {WB, WT, WP, UC, WC}. The type of a location is the memory type 
of the location. IA-32 flOl supports all five memory types. Write Back (WB), Write 
Through (WT), Write Protect (WP), Write Combining(WC) and Uncacheable. Itanium 
II II supports only three, WB, WC and UC. In addition to cacheability and write through 
implications of these memory types, they also affect the consistency model. 

2.2 Axioms and Orders 

Both < and ; are transitive, reflexive and antisymmetric orders. The program order is 
limited to operations on the same processor while the global order covers all operations 
across all processors. We also define ^4 < i? to mean A < B and A B. 
We define the following axiom to support atomic operations. 

Axiom KAtomic Operations) [L- 5^] ^ {LI < SI) MVSi : {Si < Ll)\/{Sl < 

As a result of this, we can treat atomic operations as a single operation for verification. 
We assume the following two axioms to hold, the bare minimum to be able to use the 
basic algorithm proposed in yj. 

Max , , MfiT 

Axiom 2 (Value Coherence) val[Ll] e {val[ < S^\S^ < L\],val[ ; S'^|5'*;Xy} 

The value returned by a read is from either the most recent store in program order 
or the most recent store in global order This is intuitive for a cache coherent system. 
Note that the most recent store in program order may not be a preceding store in global 
order. This is because many architectures including Intel ones can support the notion of 
store forwarding, which allows a store to be forwarded to local loads before it is made 
globally visible. Also, in the test a load may occur before any store to that location 
in which case it returns the initial value of that location. Such cases are handled by 
assuming a preliminary set of stores that write initial values to locations. The store 
values to a location and initial value of the location are chosen to be unique by the test 
generator This allows the axiom to be applied after the test is completed to link a load 
to the store that it reads. 



Axiom 3 (Total Store Order) VS^, ^^ ((^^ < Si) \l{Sl < S^)). 



Unlike we have avoided imposing any additional constraints between operations 
on the same processor. Rather, we allow these constraints to be dynamically specified. 
This allows us to parameterize the same algorithm to work across CPU architectures 
(Itanium and IA-32) and processor generations (Intel NetBurst®"* and P6 in the case of 
IA-32). 

Define Ops — {L, S, X} to be the allowed types of an operation. Thus we can de- 
fine Type{Ll) = L, Type{Sl) = S and Type{[Ll- S^]) = X. We also define Loc{Op) 
to return the memory location used by the operation. For example Loc{L\) = a. 

We can then define the constraint function 
/ : {OpsX{WB, WP, WT, WC, UC})^ {0, 1}. This is used to impose the dy- 
namic set of constraints: 

Definition 1 (Local Ordering). [Oi ; O2 and 

/{(TypeiOi), Type{Loc{Oi)),{Type{02), Type{Loc{02)))) = 1] =^ Oi < O2 

If the LHS of the implication is satisfied we call Oi and O2 as locally ordered memory 

operations. 

As an example, from 1 10| we know that Write back stores do not bypass each other. 
Hence f((S, WB),(S,WB))=1. However, write combining stores are allowed to bypass 
each other and hence f((S, WC), (S,WC))=0. There are other more subtle orderings 
which vary between processor generations and in this case we obtain appropriate order- 
ing functions from the CPU architects or designers. 

3 Algorithm 

Our objective is an algorithm that takes in the result of an execution and flags violation 
of the memory consistency model. The basic algorithm in 1 1 1 that we extend uses con- 
straint graphs to model the execution. There have been similar approaches in the past 
too, such as 1121 and an approach to the same problem using constraint solvers 1131 . 

We model the execution as a directed graph G=(V, E) where the nodes represent 
memory operations and the edges represent the < global order. However, as in 1 1 1, we 
do not put self edges although the relation is reflexive. Thus if Oi < O2 then we add an 
edge from the node for Oi to that for 02- For brevity, we refer to operations and their 
corresponding nodes by the same name. A ^ B means there is an edge from Ato B 
while A -^p B means there is a path from A to B. 

Based on the per processor ordering imposed by our ordering function /, we can 
immediately add static edges to the graph. 

Rule 1 (Static Edges) For every pair of nodes Oi and O2 such that they are locally 
ordered by definition^ add the edge Oi — > 02- 

After execution of the test, we determine a function Reads in a preprocessing step 
(operating on loads) such that Reads{L\) = Si if reads S^. Otherwise (the case 
where the initial value for the location is read), Reads{U^) — Sentinel, a special 

Intel NetBurst®is a trademark or registered trademark of Intel Corporation or its subsidiaries 
in the United States and otiier countries. 



sentinel node. We add edges from Sentinel to all other store nodes in the graph. This 
is the same construction as described in Q]. From the value axiom we know that any 
read that returns the value of a remote write must have occurred after the remote write 
has been globally observed. This allows us to add observed edges to the graph based 
on the values returned by the loads in the test. Note that for the rules below we treat an 
atomic operation as both a load and a store. 

Rule 2 (Observed Edge) For every load L\, if Reads{L\) — Si where i ^ j, or if 
Reads{L\) = Sentinel, add the edge Reads{L]^) — > L\. Note that since stores to 
the same location write unique values and all locations are initialized to hold unique 
values, value equivalence means that the load must have read that store. 

The next few set of edges are essentially inferred from the value axiom. Hence they are 
called inferred edges. 

Rule 3 (Inferred Edge 1) If Reads {LD = S^ and i ^ j then for every S^ such that 
iS** ; add the edge 5* — > S^. This follows from the value axiom since the alternative 
global order would mean the load should read the local store. 

Rule 4 (Inferred Edge 2) If Reads{L\) = S^ then for every S^ such that S^-^p 
and S^ ^ S^, add the edge Sj^ — * S^. This follows from the value axiom since the 
alternative global order would mean that the load should read S^. 

Rule 5 (Inferred Edge 3) If Reads{L^^) ~ SI then for every S^ such that S^ S^ 
add the edge S^. This follows from the value axiom since the alternative global 

order would mean that the load should read S^. 

3.1 Basic Algorithm 

The basic algorithm described in can now be summarized as follows: 

1. Compute the Reads function in a preprocessing step. 

2. Apply ruleQto add all possible edges. 

3. Apply rule|2|to add all possible edges. 

4. Apply rules|3l|4|and|5l 

5. If any edges were added in step|4]go back to step|4|else go to step|6| 

6. Check the graph for cycles. If any are found, flag an error 

An example of this algorithm applied to an execution is shown in Figure^ We use 
the notation S[X]^V for write V to location X, and L[X] — V for read from location 
X returns value V. 

Computing the Reads function is 0{n^) since we need to examine all pairs of loads 
and stores. Steps |2l and |3] are of cost 0{n^) since we examine all pairs of nodes. Step 
|4]involves determining the relationship A —^p B for 0{n) nodes. This costs O(n^) 
for each node (assuming a depth first search, as one of the obvious options) and hence 
0{n^) overall. Since the fixed point iteration imposed by steps 0] and [S] may loop for 
at most 0{n^) adding one edge on each iteration, we have a worst case complexity 
of O(n^). The detailed analysis is in 01 • There has been a subsequent improvement 



published in (5) that reduces the complexity to 0{krt'). Its correctness requires that 
there are a constant number of ordered lists on each processor. This is true because 
all loads and all stores are ordered on a processor in the TSO consistency model that 
they have considered. Unfortunately this does not hold true for both the IA-32 I.IOJ and 
Itanium 1141 memory models for various memory types (consider WC stores). Hence 
the formulation in {5\ is not general enough. 



3.2 Graph Closure 

The primary contributor to the 0{n^) complex- 
ity is deciding whether A B holds. All other 
operations can be efficiently implemented and do 
not seem to hold any opportunity for improve- 
ment, given our goal of generality. Hence, we de- 
cided to focus on the problem of efficiently de- 
termining A — >p B. A solution is to compute 
the transitive closure of the graph. We first label 
all the nodes in the directed graph under consid- 
eration, G = (y, E) by natural numbers using 
the bijective mapping function g : V ^ {l..n} 
where | V |= n. We can then represent E by the 
familiar n square adjacency matrix A such that 
{U,V) ^ E ^ A[g[U),g{V)] = l. 
Fig. 1. Example of an incorrect ex- For transitive closure of the graph we seek 
ecution with graph edges added '^^o^ed form of the adjacency matrix A such 

that U V <^ A[g{U),g{V)] = 1. A 

well known algorithm for computing the transi- 
tive closure of a binary adjacency matrix is War- 
shall's algorithm ll5l . Before giving Warshall's algorithm, we first define some con- 
venient notation and functions to transform the connectivity matrix. AddEdge{x,y) 
stands for : set j4[a;,y] = 1. Subsume{x.y) is defined as Vz such thatA[y, z] = 1, 
AddEdge{x, z). The subsume function causes all neighbors of node g^^{y) to also be- 
come neighbors of node g^^{x) in the adjacency matrix representation. 



Warshall's Algorithm: 


Incremental Warshall's Algorithm: 


for all i e {1..N} 


for all j € {l..A^} 


foralH G {l..iV} 


foralH e {l..iV} 


if{A[i,j] = 1) 


iKA[i,j] = 1 and 


Subsume(i, j) 


(Changed[j] — 1 or Changed[i] = 1)) 


end if 


Subsume{i,j) 


end for 


end if 


end for 


end for 




end for 



Incremental Graph Closure: Although Warshall's algorithm will compute the closed 
form of the adjacency matrix, any edge added by AddEdge will cause the matrix to 



Initially A^l iirul B=2 




lose this property since new paths may be available through the added edge. Hence we 
need an algorithm which when given a closed adjacency matrix and some edges added 
efficiently recomputes the closure. 

We assume that when adding edges to node U, we mark that node as changed by 
setting the corresponding bit in the change vector Changed[g{U)] = 1. We can now 
rerun Warshall's algorithm restricted to only those nodes which have either changed 
themselves, or are connected in the current adjacency matrix to a changed node. This is 
shown in pseudo-code as incremental Warshall's algorithm. 

Correctness: The restricted Warshall's algorithm clearly terminates. Now, consider 
any new path as a result of addition of edges to the graph, 

{Ui,U2), {U2,U3), {Um-i,Ujn)- There is at least one edge {Ui,Ui^i) such that 
Changed[i] = 1. We need to show that A[g{Ui), g{Um)] = 1 at termination. Since the 
matrix was already closed, we can eliminate sub-paths consisting only of edges from 
the original graph. The endpoints of these sub-paths would be connected in A. Thus 
we can form a subset of the nodes on this path (in the same order) V2, V2, V3, V/ 
where \fVi (i > 1) either Changed[g{Vi)] = 1 or Changed[g{Vi-i)] — 1. Also, Vi, 
^[9{Vi),g{Vi+i)] = 1 and we have Ui = Vi and Um = Vm- 

Observe that for every Vi if Changed[g{Vi)] = 1 then Subsume{g{Vi), g{Vi+i)) 
is called. Otherwise if this is not the last node in the path, Changed[g{Vi+i)] = 1 and 
A[g{Vi), g{Vi+i)] = 1. Hence, Subsume{g{Vi), g{Vi+i)) will always be called. 

Using this observation, we can argue that we run Warshall's algorithm on a sub- 
graph consisting only of the path Vi, V2, V2, V3, VJ (since those are connected in the 
adjacency matrix). As Warshall's algorithm is correct 1 15| we can conclude 
^[5(^1)) 5(^)1 = 1 at termination. Since Vi = Ui and Vi — Um by construction, we 
havcA[giUi),g{Um)] = 1. 

It is trivial to show that the incremental update adds no incorrect edges, since 
^[i, j] ~ 1 is a precondition to the Subsume{i, j). 

Complexity: An important observation is that the complexity of the incremental 
update is 0{mii?) where the number of changed nodes is 0{m). This is because the 
subsume step takes 0{n) and for each node. Subsume can only be called at worst 
0{m) times, if it is connected to all the changed nodes. At worst all 0{n) nodes satisfy 
the precondition for subsume and hence the O(mn^) complexity. 

3.3 Final Algorithm: 

We describe algorithms to implement the rules for adding observed and inferred edges 
in Table ^ Recall that our graph is G=(V, E) and the vertices correspond to memory 
operations in the test. Also, for ease of specification we have allowed atomic read modify 
write operations to be treated as both stores Type{Op) — S and loads Type{Op) — L. 
The ordering of for loops is not arbitrary as it may appear but rather has been carefully 
chosen to aid in parallelization as we demonstrate in section^] 

We now state the final algorithm used to verify the execution results. A benefit of our 
approach is that checking the graph for cycles is simply checking whether 3i z] = 1 
since a cycle results in a self loop due to the closure. Additionally, note that we have 



Algorithm for adding edges: 
Static Edges: 

for all Oi £V 

for all O2 £V such that Oi / O2 
If Oi is locally ordered after O2 as per definition^hen AddEdge{g{02) , g{0\)) 

end for 
end for 

Observed Edges: 

for all Oi eV such that type{Oi) = L 
for all O2 €V such that type{02) = S 
lfval{Oi) = 110/(02) 
set Reada(Oi) = 02 

If O2 is on a different CPU from Oi then AddEdge{g{02),g{0i)) 
end If 
end for 

If no corresponding store is found for this load then AddEdge{g{Sentinel) , g{Oi)) 
and set ReadaiOi) = Sentinel 
end for 

Inferred Edge 1: 

for all 0\ eV such that type{Oi) = L 

for all O2 e 't^ such that type{02) = 5 and O2; Oi and O2 / Reads{Oi) 
If O2 is on a different CPU from Oi then AddEdge{g{02), g{Reads{Oi))) and set Changed[g{02)] — 1 

end for 
end for 

Inferred Edge 2: 

for all Oi £V such that type{Oi) = L 

for all O2 e such that type (O2) = S and ^[^(Oa), 5(Oi)] = 1 

and O2 7^ ReadaiOi) 
AddEdge{g{02),g{Reads{0i))) and set Changed[g{02)] = 1 

end for 
end for 

Inferred Edge 3: 

for all Oi £V such that type{Oi) = 5" 

for all O2 <^V such that type{02) = L and A[sr(7ieads(02)), 5(Oi)] = 1 
AddEdge{g{02),g{0i)) and set Changed[g{02)] = 1 

end for 
end for 



Table 1. Pseudcode of Algorithm for Adding Edges 



merged the preprocessing step that links loads to the stores they read, into the step to 
compute observed edges. 

1. Apply ruleQto add all possible edges. 

2. Apply rule|2|to add all possible edges. 

3. Apply Warshall's algorithm to obtain the closed adjacency matrix. 

4. Apply rules|3l|4|and|5l 

5. If any edges were added in step|4|go to step 6 else go to step 8. 

6. Apply the incremental Warshall's algorithm to recompute closure and reset the changed 
vector. 

7. Go to step0 

8. Check the graph for cycles. If any are found, flag an error. 



3.4 Complexity 



The analysis of complexity is straightforward. 
Each of steps^andl^take 0{n^) since they ex- 
amine all pairs of nodes. Step Stakes 0{n'^) as 
is shown in 1 15|. Each iteration of Step |3 again 
takes O(n^) because we examine all pairs of 
nodes. Note that checking A — >p B is now 0(1) 
thanks to the closed adjacency matrix. There are 
at most 0{n^) edges to be added and hence the 
worst case complexity for Step0]is 0{n'^). The 
remaining analysis is step |6l For this we note 
that the complexity is also 0{ran?) when con- 
sidered over all invocations. Since m = O(n^) 
(bounded above by the number of edges we can 
possibly add and thereby change nodes), we have 
O(n^) as the worst case complexity for step|6l 
Cycle checking in step [S] is simply 0{n) due to 
the closed form of the adjacency matrix. Thus 
the overall complexity is O(n^) which meets our 
stated goal. Our overall space requirements are clearly 0{'n? 
trix. 




Fig. 2. Example of an actual proces- 
sor bug 



due to the adjacency ma- 



4 Parallelization 

One of the ways to mitigate the expense of an 0{n'^) algorithm is parallelization. With 
a test size of hundreds of memory operation per CPU, result validation time can eas- 
ily overwhelm the verification process. For example consider a 4 way SMP platform 
with hyperthreaded processors with a total of 8 threads and hence 800 operations. The 
way we have arranged the algorithm and data structures allows us to easily do loop 
parallelization 1 161 . 



The phases of the algorithm 
are Warshall's algorithm, incre- 
mental graph closure and the rule 
algorithms given in section 13.31 
The key observation is that in each 
case we always have no more than 
two nested for loops and there are 
no data dependences between iter- 
ations of the inner loop. The lat- 
ter is true because no two itera- 
tions change the same node in the 
graph and hence never write to the 
same element in the adjacency ma- 
trix. We are not worried about con- 
sidering edges added in previous 
iterations of the inner for loop of 
step 0] (of the algorithm in l3.3> be- 
cause such edges are considered in 
subsequent iterations, since we it- 
erate to a fix point. Also the same 
element in the Changed vector is 
not accessed by two different in- 
ner loop iterations. Hence we can 
parallelize by distributing different 
iterations of the inner for loop in 
each step across processors. Since 
each inner for loop iterates over all 
nodes in the graph, this leads to 
Fig. 3. Debug Algorithm a convenient data partitioning. We 

allocate each CPU running the ver- 
ification algorithm a disjoint subset of nodes in the graph. Each CPU executes the inner 
for loop in each phase only on nodes that it owns. Note that each CPU still needs to syn- 
chronize with all other CPUs after completion of the inner for loop in each case (this is 
similar to the INDEPENDENT FORALL construct in High Performance Fortran). 



Algorithm PrintSomeCycle: 

PossibleStart={g"^(i) j A[i,i] = 1} 

while PossibleStart is not empty 
StartNode=any node in PossibleStart 
PossibleStart=PossibleStart -{StartNode} 
CurrentList={g~^(i) | A[i,i] = 1} - StartNode 
GetCycleEdge(startNode,startNode) 

end while 

Function GetCycleEdge: 

GetCycleEdge(node Start, node Current) 
If Algorithm(Current, Start) returns true 

print edge (Current, Start) 

PossibleStart=PossibleStart -{Current} 

return true 
end If 

for each node nextNode in CurrentList 
If Algorithm(Current, nextNode) returns true 
CurrentList = CurrentList - {nextNode} 
If GetCycleEdge(Start, nextNode) returns true 
print edge (Current, nextNode) 
PossibleStart=PossibleStart -{Current} 
return true 
end If 
end If 
end for 
return false 



5 Implementation 

Intel's verification environment spans both architecture validation (Pre Silicon on RTL 
models) as well as extensive testing post silicon with the processor in an actual platform 
1171 . The algorithm described in this paper has been implemented in an Intel RIT gener- 
ator, used by verification teams across multiple Intel architectures (Itanium, IA-32 and 
64-bit IA-32). Although in the architecture validation (pre silicon on RTL simulators) 
environment direct visibility into load and store execution allows simpler tools to be 
built, it has been used in a limited fashion to generate tests that are subsequently run on 
RTL simulators. The results are then checked by the algorithm to find bugs. The great- 



est success of the tool has been in the Post SiUcon Envkonment, where the execution 
speed available (compared to RTL simulations) allows the tool to quickly run a large 
number of random tests and discover memory ordering issues on processors. In figure 
|2]we show an example of an incorrect execution corresponding to an actual bug found 
by this tool. The problem was subsequently traced to incorrect design in the CPU of the 
locking primitive for certain corner cases. 

In the Post Silicon environment the tool has been written to run directly on the De- 
vice Under Test(DUT). This was made possible by running it as a process on a device- 
less Linux kernel which is booted on the target. The primary advantage of this model 
is speed and adaptability where the RIT tool directly detects its underlying hardware, 
generates and executes the appropriate tests and then verifies the result with no commu- 
nication overhead. Another not so apparent but important advantage is scaling. As we 
anticipate future processors to increase the number of available threads, the tool scales 
seamlessly by not only running tests on the increased number of threads but also using 
all available threads to run the checking algorithm itself. This is also the reason why 
we have paid so much attention to parallelization in this work. That is to allow the al- 
gorithm to bootstrap on future generations of multi threaded processors. We point out 
here that the test generation phase is also parallelized in the tool to make optimal use of 
resources and achieve the best speedup. 

Implementation Environment: The algorithm is implemented in C and architecture 
dependent assembly that runs on a scaled down version of the Linux kernel. We have 
chosen to use the Linux process model (avoiding other threading models for simplicity) 
with shared memory segments for inter process communication. We have hand paral- 
lelized the loops using the data distribution concepts described in section|3 This allows 
us to use off the shelf compilers such as those in standard Linux distributions and work 
across all the platforms that Linux supports. 

Exploiting SIMD: The key kernel used in the iterative phase of our algorithm is 
Subsume. This is called at least once for every edge added to the graph and improving 
its performance is clearly beneficial. The implementation for Subsume{x, y) is 
Vz e {l..n}A[ ] \/ A[y,z]. Another way of looking at it is as the logical 

'OR' of two binary vectors A[x, .] = A[x, .] V A[y, .]. This could have taken as many 
as n operations in the most obvious implementation, but we instead chose to use Single 
Instruction Multiple Data (SIMD) extensions available in both the IA-32 1 8 1 and Ita- 
nium |9| instruction sets. These enable us to perform the subsume operation upto 128 
bits at a time providing a 128 times speedup to the implementation of Subsume. This 
is also the only place in our tool where we have IA-32 and Itanium specific verification 
code. The option to use SIMD to speedup the algorithm is really a consequence of the 
carefully selected data structures and the time consuming graph manipulations being 
reduced to a single well defined kernel. 

Extendibility: We support multiple architectures in our implementation by having as 
much architecture independent code as possible. This means we need to only recompile 



the tool to target different architectures. In addition we have made the tool independent 
of the memory consistency model it is verifying by taking as input to the tool a descrip- 
tion of the local ordering rules, as described in definitionn]in a standard format rulefile. 
This allows us to verify different consistency models (Itanium and different generations 
of IA-32) and adapt to changes in the consistency models that may happen in the future. 

Debug Support: A critical requirement in CPU verification is that failures should be 
root caused to bugs as soon as possible. Ease of debugging failures is very important 
in all of Intel's verification methodologies. A failure in our case is a cycle in the graph. 
The problem with our algorithm formulation is that the final cycle is detected only in 
terms of which nodes are participating in the cycle. There is no way to determine from 
the closed form adjacency matrix what is the ordering of nodes in the cycle. Also the 
nature of the basic algorithm often leads to more than one cycle in long tests. To work 
around this problem without sacrificing algorithm efficiency we use a backtracking al- 
gorithm described in Figure |3l that prints all the detected cycles. The only change we 
need to make to the algorithm described in section lSJl is that it takes as parameter an 
edge e. Whenever the AddEdge function adds the edge e during execution of the algo- 
rithm we return true indicating that this edge is actually added by one of the rules in the 
algorithm. We also return the reason for addition of this edge which allows all edges to 
be labelled with the corresponding rule, a good aid to debug. Note that the backtrack 
though costly is only run in case of failure which should be rare. 



6 Performance and Scaling 



Algorithm Complexity with 8 threads 
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Fig. 4. Algorithm Performance 



We include some performance data to support our claims of efficient algorithm de- 
sign. In figure |4(a)| we show how the cost of running the algorithm grows with increas- 



ing number of nodes. Clearly the algorithm scales well. In figure P(E)] we show how the 
speedup increases when we use more processors to run the algorithm while keeping the 
problem size (number of graph nodes) same. The near to linear speedup (ideal) indi- 
cates that the parallelization decisions have been correctly made and load balance the 
problem well among different processors. All the presented scalability data was taken 
on an 8 way 1.2 Ghz Intel®Xeon®5 processor platform running Linux. 

7 Limitations 

Although our algorithm is general enough to 
cover the memory consistency models we need to 
check for at Intel, it nevertheless has certain lim- 
itations and assumptions that we point out here. 

We assume that all stores in the test to the 
same location write unique values. Thus we are 
never in a position where we need to reconcile a 
load with multiple stores for rule|2l 

The algorithm assumes store atomicity, 
which is necessary for Axiom|3l However it sup- 
ports slightly relaxed consistency models which 
allow a load to observe a local store which pre- Fig. 5. Edge missed by the algo- 
cedes it in program order, before it is globally rithm 
observed. Thus we cover all coherence protocols 
that support the notion of relaxed write atomicity 

which can be defined as : No store is visible to any other processor before the execution 
point of the store. Based on our discussion with Intel microarchitects we determined 
that all IA-32 and current generations of Itanium microprocessors support this due to 
identifiable and atomic global observation points for any store. This is mostly due to the 
shared bus and single chipset. For Itanium we can still adapt to the case where stores are 
not atomically observed by other processors by checking only store releases 1 14|. An- 
other approach is to split stores into one for each observing processor and appropriately 
modify rule|2l This would lead to a worse case degradation of checking performance 
by a factor equal to the number of processors. 

Last, the algorithm does approximate checking only (since it is a polynomial time 
solution to an NP-Hard problem). It does not completely check for Axiom |3] since it 
does not attempt to order all stores and thereby find additional inferred edges which 
could lead to a cycle. An example taken from f]\ is shown in|5] The algorithm is unable 
to deduce the ordering from 5 [A] #6 to S[A\^h although that is the only possibility 
given that the loads to location B read different values. Adding a similar mirrored set 
of nodes, 2 stores to location C before S[A\^Q and two loads from location C after 
iS'[yl]#5 give an example violation of the TSO model which is missed by this algorithm. 
However, we hypothesize that only a small fraction of bugs actually lead to such cases 
and these are ultimately found by sufficient random testing which will show them up in 

^ Intel® Xeon® is a trademark or registered trademark of Intel Corporation or its subsidiaries in 
the United States and other countries. 




a form the algorithm can detect. This is well borne out in practice and another reason 
why we place so much emphasis on test tool performance. 

8 Conclusion 

We have described an algorithm that does efficient polynomial time memory consis- 
tency verification. Our algorithm meets its stated goals of efficiency and generality. It is 
implemented in a tool that is used across multiple groups in Intel to verify increasingly 
complex microprocessors. It has been appreciated across the corporation for finding a 
number of bugs that are otherwise hard to find and point to extremely subtle flaws in 
implementing the memory consistency model. We hope to work further in decreasing 
the cost of the algorithm by by studying the nature of the graphs generated and consid- 
ering more fine grained parallelization opportunities. 
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